How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2

In part 1, we prepped our dataset. In part 2, we train our model
Last Updated: Nov 24, 2023
In our previous article on datasets for instruction tuning, we explored how to create an instruction dataset for a Llama 2 model. In this article, we'll fine-tune it using the Alpaca dataset we previously prepared.
在我們先前關於指令調整資料集的文章中,我們探討如何為 Llama 2 模型建立指令資料集。在本文中,我們將使用先前準備的 Alpaca 資料集對其進行微調。
This codebase and article aim to be pedagogical and straightforward. The main goal here is to understand what is happening under the hood when you fine-tune an LLM for instruction tuning.
本程式碼庫和文章旨在具有教學意義且簡單明了。這裡的主要目標是了解當您對 LLM 進行指令調整時,幕後發生了什麼。
There are more sophisticated training recipes out there like the Hugging Face transformers' Trainer, trl, Axolotl, Peft, llama_recipes, the alignement_handbook, etc. In this article, we will try our best to make it is as simple as possible and make the training loop straightforward to follow.
還有更複雜的訓練食譜,例如 Hugging Face Transformers 的 Trainer、trl、Axolotl、Peft、llama_recipes、alignment_handbook 等。在本文中,我們將盡力使其盡可能簡單並使訓練更加簡單循環簡單易懂。
Llama al alpaca meeting again (the left one looks more Guanaco to me but that might be a personal thing)
Llama al alpaca 再次相遇(左邊的在我看來更像原駝,但這可能是個人的事情)

What We'll Be Covering:
我們將涵蓋的內容:



NOTE: The code associate with this post can be found here.
注意:與本文相關的程式碼可以在這裡找到。
💡

Downloading the Preprocessed Dataset from W&B
從 W&B 下載預處理資料集

Let's get started. In the previous article, we saved our preprocessed dataset as a Weights & Biases Artifact, so we can easily retrieve the dataset from there. Here's the code:
讓我們開始吧。在上一篇文章中,我們將預處理的資料集保存為權重和偏差工件,因此我們可以輕鬆地從那裡檢索資料集。這是代碼:
import wandb
from pathlib import Path

run = wandb.init(project="alpaca_ft")
artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca:v0', type='dataset')
artifact_dir = Path(artifact.download())
As we kept the dataset as plain JSON files, we can open them directly using the Python built-in json module:
由於我們將資料集保留為純 JSON 文件,因此我們可以使用 Python 內建 json 模組直接打開它們:
import json

def load_jsonl(filename):
data = []
with open(filename, 'r') as file:
for line in file:
data.append(json.loads(line))
return data

train_ds_packed = load_jsonl(artifact_dir/"train_packed_alpaca.jsonl")
eval_ds_packed = load_jsonl(artifact_dir/"eval_packed_alpaca.jsonl")
From there, we can continue our training!
從那裡,我們可以繼續我們的訓練!

Loading Local JSON Data from Disk Using HuggingFace Datasets
使用 HuggingFace 資料集從磁碟載入本機 JSON 數據

A better container for your dataset than plain JSON might be something like the Hugging Face datasets library. This has many advantages, such as fast loading, built-in map/filter methods, and bucket streaming, among others. You can quickly convert the jsonl files we created to datasets format by using the load_from_disk method:
比普通 JSON 更好的資料集容器可能是 Hugging Face 資料集庫。這有很多優點,例如快速加載、內建映射/過濾方法和桶流等。您可以使用 load_from_disk 方法快速將我們建立的 jsonl 檔案轉換為資料集格式:
import wandb
from datasets import load_from_disk # for some reason load_dataset gives an error

run = wandb.init(project="alpaca_ft")
artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca_hf:v0', type='dataset')
artifact_dir = artifact.download()
ds_packed = load_from_disk(artifact_dir)

# we are back where we started!
train_ds_packed = ds_packed["train"]
eval_ds_packed = ds_packed["eval"]

DataLoader 資料載入器

As we are training for completion, the labels (or targets) will be the inputs shifted by one. We will train with regular cross-entropy and predict the next token on this packed dataset.
當我們訓練完成時,標籤(或目標)將是輸入移動一位。我們將使用常規交叉熵進行訓練並預測此打包資料集上的下一個標記。
as input and target are the same but shifted, we lose one token at each end
由於輸入和目標相同但發生了變化,因此我們在每一端都會丟失一個令牌
In code, we accomplish this by setting the labels as the inputs shifted by one:
在程式碼中,我們透過將標籤設定為輸入移動一位來實現這一點:
{"input_ids": input_ids[:-1], "labels": input_ids[1:]} # you actually drop one value
Beware that the Hugging Face model does this for you automatically when computing the loss on the model attribute (`ModelOutput.loss`), in that case, inputs and labels are identical.
請注意,在計算模型屬性(「ModelOutput.loss」)的損失時,Hugging Face 模型會自動為您執行此操作,在這種情況下,輸入和標籤是相同的。
We can now put together a standard PyTorch DataLoader:
現在我們可以組裝一個標準的 PyTorch DataLoader:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 8 # I have an A100 GPU with 40GB of RAM 😎

train_dataloader = DataLoader(
train_ds_packed,
batch_size=batch_size,
collate_fn=default_data_collator, # we don't need any special collator 😎
)

eval_dataloader = DataLoader(
eval_ds_packed,
batch_size=batch_size,
collate_fn=default_data_collator,
shuffle=False,
)
It's always a good idea to check what a batch looks like. You can quickly do this by sampling from the DataLoader:
檢查批次的情況總是一個好主意。您可以透過從 DataLoader 取樣來快速完成此操作:
b = next(iter(train_dataloader))
b.keys(), b["input_ids"][0][:25], b["labels"][0][:25]

>> (dict_keys(['input_ids', 'labels']),
tensor([ 1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889,
14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009,
29889, 13, 13, 2277, 29937]),
tensor([13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350,
263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889,
13, 13, 2277, 29937, 2799])) ### <<< ---- shifted by 1

# input_ids.shape: (16, 1024), labels.shape: (16, 1024)
Everything looks fine; let's train this thing!
一切看起來都很好;讓我們訓練這個東西吧!

Training Loop 訓練循環


We'll start by training a model, naively making the model complete the sentence. As an exercise, I will implement this in pure PyTorch, so no abstractions are present besides grabbing the pre-trained model from the HuggingFace Hub.
我們將從訓練一個模型開始,讓模型簡單地完成句子。作為練習,我將在純 PyTorch 中實現這一點,因此除了從 HuggingFace Hub 獲取預先訓練的模型之外,不存在任何抽象。
I like storing the configuration hyperparameters in a SimpleNamespace. It's like a dictionary with .dot attribute access. Then, I can access my batch size by doing config.batch_size instead of config["batch_size"].
我喜歡將配置超參數儲存在 SimpleNamespace 中。它就像是具有 .dot 屬性存取權的字典。然後,我可以透過執行 config.batch_size 而不是 config["batch_size"] 來存取我的批次大小。
We will use some necessary tricks to make this possible:
我們將使用一些必要的技巧來實現這一點:
  • We're going to train a subset of the model parameters instead of the full model
    我們將訓練模型參數的子集而不是完整模型
  • We're going to use gradient checkpointing to save on GPU memory. Checkpointing is a method that mitigates memory usage by eliminating and recalculating certain layers' activations during a backward pass, trading additional computation time for decreased memory usage.
    我們將使用梯度檢查點來節省 GPU 記憶體。檢查點是一種透過在向後傳遞期間消除和重新計算某些層的激活來減少記憶體使用的方法,以額外的計算時間換取減少的記憶體使用。
  • Automatic Mixed Precision: This technique makes training considerably faster as the computations are done in half-precision (float16 or bfloat16). You can read more about this technique here.
    自動混合精度:由於計算以半精度(float16 或 bfloat16)完成,因此該技術使訓練速度顯著加快。您可以在這裡閱讀有關此技術的更多資訊。
  • We will implement an evaluation step that will sample from the model regularly.
    我們將實施一個評估步驟,定期從模型中採樣。
Let's get started! 讓我們開始吧!
from types import SimpleNamespace

gradient_accumulation_steps = 32 // batch_size

config = SimpleNamespace(
model_id='meta-llama/Llama-2-7b-hf',
dataset_name="alpaca-gpt4",
precision="bf16", # faster and better than fp16, requires new GPUs
n_freeze=24, # How many layers we don't train, LLama 7B has 32.
lr=2e-4,
n_eval_samples=10, # How many samples to generate on validation
max_seq_len=max_sequence_len, # Length of the sequences to pack
epochs=3, # we do 3 pasess over the dataset.
gradient_accumulation_steps=gradient_accumulation_steps, # evey how many iterations we update the gradients, simulates larger batch sizes
batch_size=batch_size, # what my GPU can handle, depends on how many layers are we training
log_model=False, # upload the model to W&B?
mom=0.9, # optim param
gradient_checkpointing = True, # saves even more memory
freeze_embed = True, # why train this? let's keep them frozen ❄️
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps
We first get a pre-trained model with some configuration parameters:
我們首先得到一個帶有一些配置參數的預訓練模型:
model = AutoModelForCausalLM.from_pretrained(
config.model_id,
device_map=0,
trust_remote_code=True,
low_cpu_mem_usage=True,
torch_dtype=torch.bfloat16,
use_cache=False,
)


Freezing the Model to Save Memory: 🥶 Jeremy Howard Style
凍結模型以節省記憶體:🥶 Jeremy Howard 風格

Training the full models is expensive, but if you have a GPU that can fit the entire model, you can skip this part. Instead, we will train a subset of the model parameters. This technique has worked in other domains and was pioneered by Jeremy and Seb Ruder.
訓練完整模型的成本很高,但如果您有可以容納整個模型的 GPU,則可以跳過這部分。相反,我們將訓練模型參數的子集。這項技術由 Jeremy 和 Seb Ruder 首創,已在其他領域發揮作用。
Transformer-based models like Llama are a stack of identical layers on top of each other with a classification layer at the end. Llama 2-7b has 32 transformer layers, so we will only train the last 8 of them. You can experiment with how many layers to freeze. You always want to train the classification head (the last layer that makes the predictions).
像 Llama 這樣基於 Transformer 的模型是一堆相同的層相互疊加,最後有一個分類層。 Llama 2-7b 有 32 個 Transformer 層,因此我們只會訓練最後 8 個。您可以嘗試凍結多少層。您總是想要訓練分類頭(進行預測的最後一層)。
In the rest of this piece, we'll explore how one can train the full model leveraging efficient parameter fine-tuning techniques like LoRA.
在本文的其餘部分中,我們將探討如何利用 LoRA 等高效率的參數微調技術來訓練完整的模型。
This technique has proven to work well in most cases; try it!
事實證明,這種技術在大多數情況下效果很好;嘗試一下!
Before trying fancy parameter-efficient methods, let's go Jeremy style and freeze most model layers. After loading the model, we freeze most of it. This way, we save a ton of memory by not computing gradients on the frozen layers.
在嘗試花哨的參數高效方法之前,讓我們採用 Jeremy 風格並凍結大多數模型層。載入模型後,我們凍結大部分模型。這樣,我們就可以透過不在凍結層上計算梯度來節省大量記憶體。
n_freeze = 24. # you can play with this parameter

# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

>> Total params: 6738.42M, Trainable: 1750.14M
You can even gain a little bit more memory by freezing the embeddings!
您甚至可以透過凍結嵌入來獲得更多的記憶體!
# Just freeze embeddings for small memory decrease
if config.freeze_embed:
model.model.embed_tokens.weight.requires_grad_(False);
You can also use gradient checkpointing to save even more (this makes training slower; how much it will depend on your particular configuration). There is an excellent article on the Huggingface website about how to fit large models on memory; I encourage you to check it!
您還可以使用梯度檢查點來節省更多(這會使訓練速度變慢;多少取決於您的特定配置)。 Huggingface 網站上有一篇關於如何在記憶體中擬合大型模型的優秀文章;我鼓勵你檢查一下!
# save more memory
if config.gradient_checkpointing:
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False}) # <- pytorch changed this
The code is borrowed from excellent Jeremy's notebook and video
程式碼借用自優秀的 Jeremy 的筆記本和影片
💡
Had to fix this cell as the "use_reentrant" argument is needed to make gradients flow form the frozen embedding!
必須修復此單元格,因為需要“use_reentrant”參數才能使梯度從凍結嵌入中流動!
💡

Optimizer and Scheduler 優化器和調度器


We'll now set the optimizer and scheduler for our training. We need this to tell PyTorch how to compute the optimization step and adjust the learning rate accordingly. There are probably fancier techniques to try, but Adam and cosine_schedule are safe starting points. We will also set up our training loop using bfloat, to make good use of those TensorCores available on modern Nvidia GPUs. We will also set up the loss_fn as Cross Entropy.
我們現在將為我們的訓練設定優化器和調度器。我們需要它來告訴 PyTorch 如何計算最佳化步驟並相應地調整學習率。可能有更高級的技術可以嘗試,但 Adam 和 cosine_schedule 是安全的起點。我們還將使用 bfloat 設定訓練循環,以充分利用現代 Nvidia GPU 上可用的 TensorCore。我們也將 loss_fn 設定為交叉熵。 
from transformers import get_cosine_schedule_with_warmup

optim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
optim,
num_training_steps=config.total_train_steps,
num_warmup_steps=config.total_train_steps // 10,
)

def loss_fn(x, y):
"A Flat CrossEntropy"
return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))
We grab the scheduler from the transformer library; why not? It's already there waiting for us. You can implement the scheduler Karpathy's style if you like.
我們從變壓器庫中取得調度程序;為什麼不?它已經在那兒等著我們了。如果您願意,您可以實現調度程序 Karpathy 的風格。

Sampling from the Model 從模型中取樣

We are almost there! Let's create a simple function to sample from the model now and then to visually see what the model is outputting.
我們就快到了!讓我們建立一個簡單的函數來立即從模型中取樣,然後直觀地查看模型輸出的內容。
Let's wrap the model.generate method for simplicity. You can grab the default sampling parameters from the GenerationConfig and pass the corresponding model_id. This will hold the defaults for parameters like temperature, top p, etc.
為了簡單起見,讓我們包裝 model.generate 方法。您可以從 GenerationConfig 取得預設取樣參數並傳遞對應的 model_id。這將保留溫度、頂部 p 等參數的預設值。
from transformers import GenerationConfig
gen_config = GenerationConfig.from_pretrained(config.model_id)

def generate(prompt, max_new_tokens=100, gen_config=gen_config):
with torch.inference_mode():
tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
output = model.generate(tokenized_prompt,
max_new_tokens=max_new_tokens,
generation_config=gen_config)
return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)
We'll run our model over the eval_dataset every 1/10th of the total train steps and log a table to Weights & Biases containing the model predictions. We will also add the relevant sampling parameters in case we change them later on.
我們將每隔總訓練步驟的 1/10 在 eval_dataset 上運行我們的模型,並將包含模型預測的表格記錄到權重和偏差。我們還將添加相關的採樣參數,以防以後更改它們。
def prompt_table(prompts, log=True):
table = wandb.Table(columns=["prompt", "generation", "concat", "max_new_tokens", "temperature", "top_p"])
for prompt in progress_bar(prompts):
out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
table.add_data(prompt, out, prompt+out, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
if log:
wandb.log({"predictions":table})
return table

Validation Step 驗證步驟


You should always have some validation done during training runs. You may skip this if the training is concise, but computing metrics on a validation dataset can give you precious insight into how the training is going. For LLMs, you also want to sample from the model to visualize how the alignment with your data is going. We implement a validate function that does a couple of things:
您應該始終在訓練運行期間完成一些驗證。如果訓練很簡潔,您可以跳過此步驟,但在驗證資料集上計算指標可以讓您深入了解訓練的進行。對於法學碩士,您還希望從模型中進行取樣,以視覺化與資料的對齊情況。我們實作了一個驗證函數,它可以完成以下幾件事:
  • Iterates through the eval_dataloader and accumulates loss and accuracy
    迭代 eval_dataloader 並累積損失和準確性
  • Logs those metrics to W&B over the entire dataset
    將整個資料集的這些指標記錄到 W&B
  • Sample from the model and log the generation to a W&B Table.
    從模型中取樣並將產生結果記錄到 W&B 表中。
@torch.no_grad()
def validate():
model.eval();
eval_acc = Accuracy()
for step, batch in enumerate(tqdm(eval_dataloader)):
batch = to_gpu(batch)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
out = model(**batch)
loss = loss_fn(out.logits, batch["labels"]) # you could use out.loss and not shift the dataset
eval_acc.update(out.logits, batch["labels"])
# we log results at the end
wandb.log({"eval_loss": loss.item(),
"eval_accuracy": eval_acc.compute()})
prompt_table(eval_dataset[:config.n_eval_samples], log=True)
model.train();
It's a good idea to run validation after some steps to assess that everything is going okay, as this is a short fine-tuning. You want to call validate at least a couple of times during training; this will depend on the task and the dataset size. For this experiment, we will perform validation 3 times (at the end of every epoch).
最好在執行一些步驟後執行驗證來評估一切是否正常,因為這是一個簡短的微調。您希望在訓練期間至少呼叫驗證幾次;這將取決於任務和資料集大小。對於這個實驗,我們將執行 3 次驗證(在每個 epoch 結束時)。

A Simple PyTorch Training Loop for Your LLM
適合法學碩士的簡單 PyTorch 訓練循環


This PyTorch training loop is a standard training loop that iterates through the train data loader and performs evaluation every a fixed amount of steps. It saves the model at the end of the training.
此 PyTorch 訓練循環是一個標準訓練循環,它迭代訓練資料載入器並每隔固定數量的步驟執行評估。它在訓練結束時保存模型。
  • Gradient accumulation: This technique enables us to simulate larger batch sizes, which is very useful when using GPUs with less memory capabilities.
    梯度累積:該技術使我們能夠模擬更大的批次大小,這在使用記憶體容量較小的 GPU 時非常有用。
  • Sampling and model checkpoint saving (this trains very fast, so there is no need to save multiple checkpoints)
    採樣和模型檢查點保存(訓練速度非常快,因此不需要保存多個檢查點)
  • Compute token accuracy: It is a better metric than loss because it is easy to understand, as the accuracy number represents a quantity we can interpret. Also, let's not forget that this is a classification task for the next token prediction! If you don't believe me, Jeremy Howard still suggests accuracy for Causal Language Modeling as the metric to go with.
    計算令牌準確性:這是一個比損失更好的指標,因為它很容易理解,因為準確性數字代表了我們可以解釋的數量。另外,我們不要忘記這是下一個標記預測的分類任務!如果你不相信我,傑里米·霍華德仍然建議使用因果語言建模的準確性作為衡量標準。
wandb.init(project="alpaca_ft", # the project I am working on
tags=["baseline","7b"],
job_type="train",
config=config) # the Hyperparameters I want to keep track of

# Training
acc = Accuracy()
model.train()
train_step = 0
pbar = tqdm(total=config.total_train_steps)
for epoch in range(config.epochs):
for step, batch in enumerate(train_dataloader):
batch = to_gpu(batch)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
out = model(**batch)
loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps # you could use out.loss and not shift the dataset
loss.backward()
if step%config.gradient_accumulation_steps == 0:
# we can log the metrics to W&B
wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
"train/accuracy": acc.update(out.logits, batch["labels"]),
"train/learning_rate": scheduler.get_last_lr()[0],
"train/global_step": train_step})
optim.step()
scheduler.step()
optim.zero_grad(set_to_none=True)
train_step += 1
pbar.update(1)
validate()
pbar.close()
# we save the model checkpoint at the end
save_model(
model,
model_name=config.model_id.replace("/", "_"),
models_folder="models/", log=config.log_model)
wandb.finish()
This trains in around 120 minutes on an A100.
在 A100 上訓練大約需要 120 分鐘。
The Hugging Face course has a similar to this one that uses pure PyTorch to train a model from the HF hub.
Hugging Face 課程與此課程類似,使用純 PyTorch 從 HF 中心訓練模式。
💡

Results 結果

We present the loss curves and the accuracy metrics. Our total training steps are around 1150 steps (3 epochs) with gradient accumulation steps = 4. We pass two samples before updating the gradients.
我們展示了損失曲線和準確性指標。我們的總訓練步驟約為 1150 個步驟(3 個時期),梯度累積步驟 = 4。我們在更新梯度之前傳遞兩個樣本。
Our results: 我們的結果:

02004006008001ktrain/global_step00.20.40.60.811.2
4005006007008009001kStep0.9330.9340.9350.9360.937
02004006008001ktrain/global_step0.660.680.70.720.740.760.780.8
We can inspect the model generations from the table below! At first glance, the results look reasonable! Let's manually check a fixed row. The input is "Generate a list of 5 potential threats to digital security," and we can see on the generation column the evolution of the generations over time by clicking on the arrows at the bottom of the cell:
我們可以從下表檢查模型代數!乍一看,結果看起來很合理!讓我們手動檢查固定行。輸入是“生成數字安全的 5 個潛在威脅的列表”,我們可以通過單擊單元格底部的箭頭在生成列上看到各代隨時間的演變:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Generate a list of 5 potential threats to digital security

### Response:
1. Phishing
2. Malware
3. Man-in-the-Middle Attacks
4. Identity Theft
5. Social Engineering
1 of 3 1 個(共 3 個)
0: 1. Malware and Viruses: Malware and viruses can cause severe damage to computer systems by corrupting files, stealing sensitive information, or even taking control of the system. 2. Phishing: Phishing is a form of fraud where attackers attempt to trick users into revealing sensitive information or clicking on malicious links. 3. Unsecured Wi-Fi Networks: Using unsecured Wi-Fi networks can leave devices open to attacks, as attackers can intercept data transmitted over the network. 4. Social Engineering: Social engineering is the practice of tricking people into divulging sensitive information or taking certain actions, often through manipulation or deceit. 5. Weak passwords: Using weak, easy-to-guess passwords can make it easy for attackers to gain access to accounts and steal data.</s> 1: 1. Malware and Viruses: Malware and viruses can cause severe damage to computer systems by corrupting files, stealing sensitive information, or even taking control of the system. 2. Phishing: Phishing is a form of fraud where attackers attempt to trick users into revealing sensitive information or clicking on malicious links. 3. Unsecured Wi-Fi Networks: Using unsecured Wi-Fi networks can leave devices open to attacks, as attackers can intercept data transmitted over the network. 4. Social Engineering: Social engineering is the practice of tricking people into divulging sensitive information or taking certain actions, often through manipulation or deceit. 5. Weak passwords: Using weak, easy-to-guess passwords can make it easy for attackers to gain access to accounts and steal data.</s> 2: 1. Malware and Viruses: Malware and viruses can cause severe damage to computer systems by corrupting files, stealing sensitive information, or even taking control of the system. 2. Phishing: Phishing is a form of fraud where attackers attempt to trick users into revealing sensitive information or clicking on malicious links. 3. Unsecured Wi-Fi Networks: Using unsecured Wi-Fi networks can leave devices open to attacks, as attackers can intercept data transmitted over the network. 4. Social Engineering: Social engineering is the practice of tricking people into divulging sensitive information or taking certain actions, often through manipulation or deceit. 5. Weak passwords: Using weak, easy-to-guess passwords can make it easy for attackers to gain access to accounts and steal data.</s>
0: 1. 惡意軟體和病毒:惡意軟體和病毒可能會損壞檔案、竊取敏感訊息,甚至控制系統,從而對電腦系統造成嚴重損害。 2. 網路釣魚:網路釣魚是一種詐騙形式,攻擊者試圖誘騙使用者洩漏敏感資訊或點擊惡意連結。 3. 不安全的 Wi-Fi 網路:使用不安全的 Wi-Fi 網路可能會使裝置容易受到攻擊,因為攻擊者可以攔截透過網路傳輸的資料。 4. 社會工程:社會工程是通常透過操縱或欺騙來誘騙人們洩露敏感資訊或採取某些行動的做法。 5. 弱密碼:使用弱密碼、易於猜測的密碼可以使攻擊者輕鬆存取帳戶並竊取資料。 1: 1. 惡意軟體和病毒:惡意軟體和病毒可能會損壞檔案、竊取敏感訊息,甚至控制系統,從而對電腦系統造成嚴重損害。 2. 網路釣魚:網路釣魚是一種詐騙形式,攻擊者試圖誘騙使用者洩漏敏感資訊或點擊惡意連結。 3. 不安全的 Wi-Fi 網路:使用不安全的 Wi-Fi 網路可能會使裝置容易受到攻擊,因為攻擊者可以攔截透過網路傳輸的資料。 4. 社會工程:社會工程是通常透過操縱或欺騙來誘騙人們洩露敏感資訊或採取某些行動的做法。 5. 弱密碼:使用弱密碼、易於猜測的密碼可以使攻擊者輕鬆存取帳戶並竊取資料。 2: 1. 惡意軟體和病毒:惡意軟體和病毒可能會損壞檔案、竊取敏感資訊甚至控制系統,從而對電腦系統造成嚴重損害。 2. 網路釣魚:網路釣魚是一種詐騙形式,攻擊者試圖誘騙使用者洩漏敏感資訊或點擊惡意連結。 3. 不安全的 Wi-Fi 網路:使用不安全的 Wi-Fi 網路可能會使裝置容易受到攻擊,因為攻擊者可以攔截透過網路傳輸的資料。 4. 社會工程:社會工程是通常透過操縱或欺騙來誘騙人們洩露敏感資訊或採取某些行動的做法。 5. 弱密碼:使用弱密碼、易於猜測的密碼可以使攻擊者輕鬆存取帳戶並竊取資料。
Joined On 已加入
generation 世代
output 輸出
3
Run set 運行設定
1
These were samples generated and computed during training. We'll evaluate them on the entire test dataset in the next section.
這些是在訓練期間產生和計算的樣本。我們將在下一節中在整個測試資料集上評估它們。

GPT-4 based evaluation 基於 GPT-4 的評估


Let's use GPT-4 to compare the results generated by the fine-tuned model against GPT-3.5. Let's also get GPT-4's reason for picking one over the other. This evaluation technique has been used in multiple places, for instance, MT Bench (MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants)
讓我們使用 GPT-4 將微調模型產生的結果與 GPT-3.5 進行比較。我們也可以了解 GPT-4 選擇其中一個的原因。這種評估技術已在多個地方使用,例如MT Bench(MT-bench是一組用於評估聊天助理的具有挑戰性的多輪開放式問題)
You can read more about LLM supervised evaluation in Ayush's article. 
您可以在 Ayush 的文章中閱讀有關 LLM 監督評估的更多資訊。
💡
GPT4 is better at reasoning than GPT3.5. Also, it wouldn't be fair to use the same model generating one of the responses to judge itself. Of course, this technique is not perfect, and other studies have shown that sometimes this evaluation strategy may not be consistent with permutation (switching the answers) or even calling the model multiple times, which could lead to different responses due to the stochastic nature of the generations. One way to mitigate this is setting up the temperature sampling parameters closer to zero to make the model more deterministic.
GPT4 比 GPT3.5 更擅長推理。此外,使用產生其中一個回應的相同模型來判斷自身是不公平的。當然,這種技術並不完美,其他研究表明,有時這種評估策略可能與排列(切換答案)甚至多次呼叫模型不一致,這可能會由於模型的隨機性而導致不同的回應幾代人。緩解這種情況的一種方法是將溫度採樣參數設定為接近零,以使模型更具確定性。
The clear win this approach has is that one can quickly implement LLM-based evaluation, and using a powerful model like GPT-4, we can create a baseline score quickly. Ideally, you would want to set up a human-based assessment at some point, but this is more costly and slower to implement.
這種方法的明顯優勢是可以快速實施基於 LLM 的評估,並且使用 GPT-4 等強大的模型,我們可以快速建立基線分數。理想情況下,您希望在某個時候建立基於人工的評估,但這成本更高,實施起來也更慢。
We can leverage OpenAI function calling to format the output of GPT-4 with the corresponding choice made and the reason.
我們可以利用 OpenAI 函數呼叫來格式化 GPT-4 的輸出,並給出對應的選擇和原因。
def gpt4_judge(instruction, gen1, gen2, model="gpt-4"):
system_prompt = ("You will be presented with a choice of two possible responses for an instruction"
"You have to pick the best one and give a reason why.\n"
"The reponse should follow the instructions and use the provided context if there is some\n"
"If both answers are equivalent, pick the value 0")
message = "{instruction}\n Answer 1: \n{gen1}\n Answer 2:\n{gen2}".format(instruction=instruction, gen1=gen1, gen2=gen2)
completion = openai.chat.completions.create(
model=model,
messages=[{"role": "system",
"content": system_prompt,
},
{"role": "user",
"content": message,
},],
function_call = {"name": "make_choice"},
functions = [{
"name": "make_choice",
"description": "Select the best generation and explain why",
"parameters": {
"type": "object",
"properties": {
"choice": {
"type": "integer",
"description": "the choosen alternative, zero if equivalent",
},
"argument":{
"type": "string",
"description": "Reason why the choice was made",},},},
"required": ["choice", "argument"],},
],)
return completion
You can inspect the results in the evaluation tables below. We generated 250 completions using GPT-3.5 and asked GPT-4 to pick the best one; we also left the possibility of marking both as equally good:
您可以在下面的評估表中檢查結果。我們使用 GPT-3.5 產生了 250 個補全,並要求 GPT-4 選擇最好的一個;我們也保留了將兩者標記為同樣好的可能性:
  • Both models are good 兩個型號都不錯
  • Fine-tuned Llama was better
    微調過的 Llama 效果更好
  • GPT-3.5 produced better output
    GPT-3.5 產生更好的輸出
To make our testing more robust, we inverted the order and asked GPT-4 again, and we only kept the choices were GPT-4 consistently picked the same answer no matter the order. For our surprise 34 times GPT-4 switched sides! So take this evaluation with a grain of salt!
為了使我們的測試更加穩健,我們顛倒了順序並再次詢問 GPT-4,並且無論順序如何,我們只保留 GPT-4 始終選擇相同答案的選項。讓我們驚訝的是,GPT-4 已倒戈 34 次!所以對這個評價持保留態度!
💡

Order matter: Inverting the query order makes GPT-4 switch sides (sometimes)
順序問題:反轉查詢順序會使 GPT-4 改變立場(有時)

You can check GPT-4 inconsistency here:
您可以在此處檢查 GPT-4 不一致情況:
  • Occasionally it prefers short answers and then switches sides and values the explanation and the longer answer 🤔
    有時,它更喜歡簡短的答案,然後轉換立場並重視解釋和較長的答案🤔
  • In some cases, it judges both equally good and, when inverted, prefers one.
    在某些情況下,它對兩者的判斷同樣好,並且當相反時,它更喜歡其中一個。
  • Check the answers below by clicking the < > 👇
    點選 < > 👇 看下面的答案

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe a unique food you have tried

### Response:
As an AI, I don't have a physical form, so I haven't tried any foods. However, I can tell you about some unique foods that I have heard of. For example, there is a type of cheese called "Epoisses" that is made from cow's milk and is known for its strong, pungent odor. It is said to be one of the most pungent cheeses in the world. Another unique food is "Eel" which is a type of fish that is popular in Japan. It is said to have a very strong flavor and is often served raw.
One unique food that I have tried is called haggis. Haggis is a traditional Scottish dish and it is made from various sheep's organs including the heart, liver, and lungs, which are minced and mixed with spices, oatmeal, onions, and suet. This mixture is then stuffed into a sheep's stomach and boiled. While the ingredients may sound unconventional to some, haggis has a rich and hearty flavor that is truly unique. It is often served with neeps and tatties (mashed turnips and potatoes) and enjoyed as part of a traditional Scottish meal.
llama7b_ft
gpt3.5
Answer 1 follows the instruction better because as an AI, the entity doesn't have personal experiences or the capability to try food. Answer 2, on the other hand, humanizes the AI by suggesting it has tried a certain food, which isn't possible.
The given instruction was to describe a unique food the responder has tried. Answer 1 accurately follows these instructions by describing a unique food item that they have tried, haggis. Answer 2, instead of describing a food they have tried, talks about food they have heard of which does not directly comply with the instructions despite it being technically true in context of an AI.
False 錯誤的
prompt 迅速的
llama7b_ft 駱駝7b_ft
gpt3\.5
choice 選擇
choice_inverted 選擇_反轉
reason 原因
reason_inverted 原因反轉
agree 同意
12
Run set 運行設定
1

Evaluation Results 評估結果

You can also check why GPT-4 picked what on the "argument" column. The fine-tuned Llama is good but not near as good as GPT-3.5. This makes sense as to how the 7b model trained on a handful of GPT4 generations would be better than a probably much bigger model like GPT3.5. Anyway, other questions arise:
您也可以在「參數」欄位中查看 GPT-4 選擇的原因。經過微調的 Llama 不錯,但不如 GPT-3.5。這是有道理的,因為在少數 GPT4 世代上訓練的 7b 模型會比 GPT3.5 等可能更大的模型更好。無論如何,還會出現其他問題:
  • Is the 7b model too small? If we switched to Llama2-13b, would the outcome be the same?
    7b型號太小了嗎?如果我們改用 Llama2-13b,結果會一樣嗎?
  • Should we train more layers of the model? All layers?
    我們應該訓練更多層的模型嗎?所有層?
  • We will explore some of these questions in the following articles.
    我們將在接下來的文章中探討其中一些問題。

49
154
13
choice_name.count
1
both
2
gpt3.5
3
llama7b_ft
choice_name
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate an medical diagnosis from the following test result.

### Input:
Patient has elevated levels of white blood cells.

### Response:
Based on the input, suggest a medical diagnosis.

### Response:
Based on the input, suggest a medical diagnosis.

### Response:
Based on the input, suggest a medical diagnosis.
Based on the elevated levels of white blood cells, the medical diagnosis could possibly be an infection or inflammation in the body. However, it is important to consult with a healthcare professional for an accurate diagnosis and appropriate treatment.
2
1
Option 2 is more suitable since it provides a prediction for a possible medical condition based on the given information, whereas Option 1 merely restates the instruction.
Answer 1 is better because it presents a more relevant diagnosis suggestion based on the provided context, while Answer 2 merely restates the instruction.
True
gpt3.5
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate this text into Spanish.

### Input:
We are excited to work with you on this project.

### Response:
Estoy muy feliz de trabajar con usted en este proyecto.
Estamos emocionados de trabajar contigo en este proyecto.
2
1
Answer 2 is more accurate. The original text uses 'we' and 'you', which is correctly represented in the second answer as 'Estamos' and 'contigo'. The first one uses 'Estoy' which means 'I am', not aligning with the original text.
The first option is more accurate because it includes the pronoun 'we' and the emotion 'excited', which matches the original English text.
True
gpt3.5
prompt
llama7b_ft
gpt3\.5
choice
choice_inverted
reason
reason_inverted
agree
choice_name
1
2
Run: decent-mountain-55 運行:decent-mountain-55
1

Conclusion and Final Remarks
結論和最後評論

Fine-tuning a model on an instruction dataset is just a particular case of completion training, where one constructs the dataset in an organized way so it can learn to follow instructions. This is a small example to demystify the complexity of what's happening under the hood when using specialized libraries to fine-tune.
在指令資料集上微調模型只是完成訓練的一種特殊情況,其中以有組織的方式建立資料集,以便它可以學習遵循指令。這是一個小例子,旨在揭秘使用專門庫進行微調時幕後發生的事情的複雜性。
Of course, Llama 7B is the smallest of the models out there, and one may obtain better results using the bigger brothers, but we managed to give instruction capabilities to a pre-trained model that did not have them. Now the model replies most of the time in the format specified and generates reasonable answers.
當然,Llama 7B 是現有模型中最小的,使用較大的模型可能會獲得更好的結果,但我們設法為不具備指令功能的預訓練模型提供指令功能。現在,模型大部分時間都會以指定的格式回應並產生合理的答案。
GPT-4 tends to prefer GPT-3.5... "GPTs prefer GPTs. — Ayush T." 🤣
GPT-4 往往更喜歡 GPT-3.5...“GPT 更喜歡 GPT。— Ayush T。” 🤣
This is the first of two articles about Instruction tuning. In the following piece, we will train the model by using the Hugging Face ecosystem and W&B integration. This will significantly simplify the preprocessing and code one must write.
這是兩篇關於指令調優的文章中的第一篇。在下一篇文章中,我們將使用 Hugging Face 生態系統和 W&B 整合來訓練模型。這將大大簡化必須編寫的預處理和程式碼。

Never lose track of another ML project. Try W&B today.
List<Maybe<File<(table)>>>
List<File<(table)>>
List<File<(table)>>
List<File<(table)>>